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Abstract 

In high-dimensional data analysis, penalized likelihood estimators are shown to provide 
superior results in both variable selection and parameter estimation. A new algorithm, APPLE, 
is proposed for calculating the Approximate Path for Penalized Likelihood Estimators. Both the 
convex penalty (such as LASSO) and the nonconvex penalty (such as SCAD and MCP) cases 
are considered. The APPLE efficiently computes the solution path for the penalized likelihood 
estimator using a hybrid of the modified predictor-corrector method and the coordinate-descent 
algorithm. APPLE is compared with several well-known packages via simulation and analysis 
of two gene expression data sets. 

1 Introduction 

Variable selection is a vital tool in statistical analysis of high-dimensional data. Typically, 
a large number of potential predictors are included during the first stage of modeling, in 
order to avoid missing important links between a predictor and the outcome. This practice 
has become more popular in recent years for two primary reasons. First, in many recently 
promising fields, such as bioinformatics, genetics and finance, more and more high-throughput 
and high-dimensional data are being generated. Secondly, low cost and easy implementation 
for data collection and storage have made problems for which the number of variables is large, 
in comparison to the sample size, possible to be handled. 

In order to provide more representative and reasonable applications of models in a math- 
ematical framework, we often seek a smaller subset of important variables. T he first attemp t 
to variable s e lectio n was the Ip-type r e gulari zation methods, including AIC (|Akaikel 1 19731 ). 
C p (|Mallowsl . Il973h and BIC l|Schwarzl , |l97Sh . which work well in low-dimensional cases. In 
addition, they also exhibit good sampling properties dBarron et al.|.ll999h . However, searching 
all the possible subsets can be unstable jBreiman and Gad . 1 19961 ) . and in high-dimensional 
settings, the combinatorial problem has NP-complexity, which is computationally prohibitive. 
As a result, numerous attempts have been made to modify the ^o-type regularization to re- 
duce the computa tional burden. The most popu lar penalized regression method is LASSO 
(|Tibshiranil . I1996T ) or equivalently Basis Pursuit (|Chen and Donohol . 1 19941 ). Being a convex 
penalty, it is computationally convenient, but lacks the oracle property and shrinks estimators 
regardless of importance. Hence, some n onconvex penalties have been proposed in o rder to 
yield better performance, such as SCAD (|Fan and Lil. 120011 ) and MCP (|Zhang|. l20lbT ). Also, 
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for generalized linear models (GLM), penalize d likelihood metho ds ha ve been studied for high- 
dimensi onal variable select ion, for example in lFan and Lil (|200ll ) and Ivan de Geerl fcOOSft . We 
refer to lFan and Lvl (|2010h for a review of variable selection in high-dimensionality. 

Throughout the paper, we assume we have i.i.d. observations (xi,yi),i — 1, • • • ,n, where 
Xi is a p-dimensional predictor and yi is the response. We further assume the conditional 
distribution of y given x belongs to an exponential family with canonical link, that is, it has 
the following density function 



f(y,x,f3) = c(y)exp 



yO-m) 



a(</>) 



(1) 



where 9 — x'f3 and cj> £ (0, oo) is the dispersion parameter. 

In view of |T]), the log- likelihood of the sample is given, up to an affine transformation, by 

n 

£(y;0) =n- 1 Y l [Vi0i-b(0 i )]. 

i=i 

Here, we are interested in estimating the p-dimensional vector (3, and the penalized likelihood 
estimator is defined as 

3(A) = argmm{-.e(y;/3) +Pa(/3)}, 

where p\{-) is the penalty function and A > is the regularization parameter. 

Developing an efficient algorithm for calculating the solution path of the coefficient vector 
/3(A), as A varies along a possible set of values, is very desirable. There is a vast literature 
on calculating such a path for penalized linear re gression. For th e convex penal t y LA SSO, 
least angle regression (LARS) (|Efron et all 120041 ). or homotopy |Osborne et all I2000T ) are 
efficient methods for computing the entire path of LASSO solutions in the l inear regression 
case. For nonconvex penalties in cluding SCAD an d MCP, IFan and Lil (|200lh used the local 
quadratic approximation (TO A); IZou and Lil (|2008h proposed the local linear approximation 
(LLA), which makes a local linear approximation to the penalty functi on, thereby y ielding an 
objective function that can be optimized by using the LARS algorithm. IZhand (|2010h proposed 
the penalized linear unbiased selection (PLUS), which is designed for the linear regression 
penalized by quadratic spline penalties, including LASSO, SCAD and MCP. More recently, 
coordinat e descent m ethods have received conside r able attention in high - dimen s ional settings, 
includ i ng IFuI (fl998l). IShevade and Keerthl d2003|). [ Krishnapuram et al.l |2005l ). iGenkin et al.l 
(|2007D . lFriedman et alj (|2007|).I Wu and Langj (I2008T). among ot hers. Other work o n pena lized 



linear regress i on inc ludes Hastie et al.l ( 2004 ). iDaubechies et al.l |2004l 'l , iKim et all (|2007l ') and 
IWei and Zhul |2012l ). 

Different from linear regression, derivatives of the log-likelihood in GLM are changing with 
respect to the regularization parameter A. There has been major res earch on calculating the 
solution path for penalized likelihood estimators in the GLM setting. iPark and Hastie] (|2007l ) 
proposed the glmpath algorithm. They considered the solution (3 as a function of A, and used a 
linear approximation of this function to update the estimator (3. They selected the step length 
in decrea sing A by using an ap proximate smallest length that will change the active set of 
variables. iFriedman et al.1 (|20ld ) proposed a coordinate descent algorithm for penalized GLM, 
in which they quadratically approximate the log-likelihood function and sequentially solve the 
resulting penalized weighted least squares problem on a grid of A values. IWul |2011 ) proposed 



an ordinary differential equation-based solution path algorithm, which used quasi-likelihood 
instead of likelihoo d models, i n order to use LARS more straightforwar dly. However, all the 
numerical results in IWul (|201ll ) are based on the small p large n setting. iBrehenv and Huand 
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J201 if) adopted a coo rdinate descent algorithm for MCP and SCAD penalized GLM. Like 
iFriedman et al.l (|201dl ). they used a quadratic approximation to the log-likelihood part and 
then use d coordinate descent to update the regression parameter estimator. For the MCP 
penalty l|Zhand . I2OI0I ), the tuning parameter 7 is used to adjust the concavity of the penalty. 
The smaller 7 is, the more concave the penalty is, which means finding a global minimizer is 
more difficult; but on the other hand, it results in less bia sed estimators. The tuning parameter 
7 can be changed freely from 1+ to 00. In the GLM case. iBrehenv and Huangl (|201 ll ) proposed 
the adaptive rescaling, which allows the range of the para meter 7 to be as wide a s it can be 
for th e linear regression case. Oth er related papers in clude IZhu and Hastiel l|2004 ). [Lee et al.l 
(|2006l ). lRosset and Zhul |2007l ) and lMeier et""aH |20Qg| ). 

In this work, we propose a new path algorithm, the approximate path for penalized likeli- 
hood estimators (APPLE), under the setting of high-dimensional GLM. Different from linear 
regression, it is often difficult to get explicit solutions in GLM. Taking accuracy and feasibility 
into account, instead of linear approximation of the corresponding change in (3 with the de- 
crease in A, which is used by most of the previous work, we use quadratic approximation to 
get a warm-start in updating. Then targeting on the KKT conditions, we perform a correction 
by op timizing a convex problem. Inspired by the adaptive rescaling in IBrehenv and Huangl 
(|201 if ) . we develop a modified concavity adaptation method for MCP when updating the solu- 
tion, which is shown to have better performance when 7 is small. In this paper, not only path 
algorithms for LASSO penalized GLM are derived, but also path algorithms for nonconvex 
penalized GLM, which have appeared in few of the previous work. Here we mainly focus on 
MCP as an example of nonconvex penalty, but it can be easily extended to other quadratic 
spline penalty functions. 

For LASSO, we detect the active set through the KKT conditions like most of the previous 
work. However, for some nonconvex penalties, such as MCP, by fixing A and the concavity 
parameter 7, the value of the derivative of the penalty function decreases towards zero as 
the absolute value of the estimator increases towards A7. We introduce a modified active set 
detection method, which has not appear in any of the previous work. 

The rest of the paper is organized as follows. In Sections 2 and 3, we introduce the path 
algorithm APPLE for the LASSO and MCP penalties, respectively. We conduct simulation 
studies in Section 4 and two real data examples are presented in Section 5. A short summary 
is given in Section 6, while the technical details for logistic regression and Poisson regression 
are presented in the Appendix. 

2 APPLE with LASSO Penalty 

LASSO (|Tibshiranlll996h is a popular method for regression that uses an ^i-penalty to achieve 
simultaneous variable selection and parameter estimation. The idea has been broadly applied 
in GLM, where the problem is to minimize a convex function. In this section, we describe the 
details of the APPLE algorithm for LASSO penalized GLM. 

2.1 Problem Setup 

Let {(xi, yi), i = 1, • • • , n} be n i.i.d. pairs of p predictors and a response as described 
in the introduction. By adding an additional column of l's to the design matrix X, the 
intercept /3o is absorbed into the coefficient vector (3. We are interested in finding the maximum 
likelihood solution for f3 = (/3o, /3i, • • ■ , ftp)' , with a penalization on the size of the ^i-norm of 
the coefficients excluding the intercept. With a little abuse of notation, we denote ||/3|ji = 
Y2i=i Therefore, the optimization problem for a given A is reduced to finding (3, which 
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minimizes the following: 



L x ((3) = -£(y;p) + \\\(3\\ 1 
= — y>i*09)< - H8(Ph)} + A||/3||i. 



(2) 



As is common in GLM, the function b(ff) is implicitly assumed to be twice continuously 
differentiable with b"(0) always positive. It is straightforward to check that L\(-) is a convex 
function. Therefore, for a given A, the unique minimizer /3(A) is the solution to the KKT 
conditions, which are given as follows. 



01 



Po=0O 



= Asgn(/3 j ) for j = !,-■■ ,p, s.t. ^ 0, 



(3) 



9/3, 



< A for j = 1, • • • ,p, s.t. /3j = 0. 



2.2 Grid of Penalty Parameter 

It is easy to notice from the KKT conditions that when 

A > A max = max \d£/dPj\p i= o\, 
i<j< P 

Pj = for 1 < j < p. As A decreases from A max to 0, (3 — /3(A) will change from (except 
for the intercept /3o) to the MLE solution. However, the full MLE solution has poor predictive 
performance an d lacks the s parsity property because of the high-dimensionality. Here, following 
iFriedman et~afl (|2010h and lBrehenv and Huanj (|201ll ). we set the minimum value of A to be 
A m in = <5A ma x and construct a sequence of K values of A decreasing from A max to A m in on the 
logarithm scale. We denote the sequence of A as Afc, where k = 1, • • • , K. Typical values are 
5 = 0.01 and K = 100. 



2.3 Update 

From the KKT conditions, we can see the relationship between \d£/d/3j\ and A determines 
whether the variable /3j is activated or not. For Afc, we define the active set At as follows, 



A k = i 1 < j < p 



> Afc ^U{0}, 



dp, 

and the step size as Afc = Afc + i — Afc. For a given Afc and active set Ak, we update the active 
coordinates together using the quadratic approximation, 

3™=3^+ S ' fc ».Afc + i d W.A 2 fc, (4) 

where s' fc ' and are the first and second derivatives of ft ' with respect to A, respectively. 
Sinc e the intercept is n ot pe nalized, the first coordinates of and are both 0. Different 
from lPark and Hastiel (|2007l ). where linear approximation was used, the quadratic approxima- 
tion Q is more accurate and the computation time is kept at the same scale. Keep in mind that 
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^(fe+1,0) 

here, the coefficients for the variables outside of Ak are set to be in p . Additionally, 

the approximation Q will typically cause a small deviation from the KKT conditions, which 

makes the following correction step necessary, in order to get the exact solution p at the 
current A^+i. 

Here, we adopt two different correction methods depending on the current model size. To 
be more precise, at step fc, we check the following inequality. 

#{j : 3f +1 ' 0) + 0} < cVTi, (5) 

where c is a user-specified constant. We set c = 1 in all our numerical examples. If ((5| holds 
(i.e., the current solution is relatively sparse compared with the sample size), we use a Newton- 
Raphson correction, otherwise, a coordinate descent correction is applied. When the correction 

method is stopped by a convergence check, the last p is denoted as p 

2.3.1 Newton- Raphson Correction 

^(fc + 1,0) 

Given the current solution p , we use the following Newton-Raphson method to correct 

the estimate until convergence to p , 

_ a(* J) ( <9 2 L (fe) Y 1 f <9L« 
Pa,. - Pa, ■ 



Here, all the acti ve variables are c orrect ed to gether, which is differ e nt fro m coordinate descent 
method used in iFriedman et all (|2010h and iBrehenv and HuanJ (|201ll ). We notice in our 



simulation studies that when (JS| holds, the Newton-Raphson type correction tends to be much 
faster than the coordinate descent correction method. 



2.3.2 Coordinate Descent Correction 

When ((5]) does not hold (i.e., the number of active variables is relatively large), the Newton- 
Raphson method involves inverting a big matrix (d 2 L/df3 2 ), which may become ill-conditioned 
and cause stability issues in the iteration. Therefore, under this scenario, the more stable 
coordinate descent method is applied. In the coordinate descent algorithm, we fix all coefficients 
except /3j, and minimize ([2} for the current A by updating j3j. The process is repeated for 
j — 0, ••• , p. After sweeping through all coordinates, we compare the new solution with 

p . If they are sufficiently close, we have reached convergence; otherwise, the sweeping 

process is repeated until the two most recent estimators are close enough. What makes the 
coordinate descent algorithm particularly attractive is that there is an explicit formula for 
each coordinate upda te. The details for the coordin ate descent algorithm may be found in 
IFriedman et~ai] |20ld ) and lBrehenv and Huanj (|201ll ). 



2.4 Stopping Rules 

•~(ib) 

Following the updating process, we will obtain a solution path p for k = 1, • • ■ , K. However, 
from our simulation results, we notice that in most cases, the solutions near the end of the path 
involve too many spurious variables. Therefore, the following two stopping rules are proposed 
to further speed up the path calculation process. 
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(a) (Model saturation detection). The first rule is designed to terminate the path algorithm 
if the fitting value is too extreme. For example, in logistic regression, if the current 

estimated probability Pi = exp(x' i f3 )/(l + exp(x' i /3 )) satisfies 
max pi > 1 — e or min pi < e, 

i — 1 , ■ ■ ■ ,?i i— 1 , ■ ■ ■ ,n 

where e is a predefined positive constant, we terminate the algorithm. 

(b) (A pre-specified maximum size of the sparsity). In some real applications, the practitioner 
has an upper bound on the size of the sparsity for various reasons. For example, in 
the optimal portfolio allocation problem, one common restriction is the control of the 
transaction costs, which in turn puts a restriction on the maximum number of selected 
stocks. In order to avoid missing the important variables, we usually set the upper limit 
significantly larger than the model size we need. 

Although early stopping is performed following these two rules, the optimal solution always 
occurs before the stopping point in our numerical experience. 

2.5 Summary of the Algorithm 

51. Define the grid of penalty parameters A as {Ai, • • • , Ak}, where Ai = A max , Xk = A m i n = 
5A max , and the remaining ones decrease on the logarithm scale. Set k — 1 and the initial 

estimate /3 = 0. 

52. Calculate the active set by A k = {j : \d£/dfij\ > A fc } U {0}. Denote A fe = A fe+ i - A fe . 
The approximate solution is given by 

3i +1 ' 0) =«+« (fe) -A, + ^-At 

53. Correct the current solution towards the KKT conditions. If (0 holds, we use the Newton- 
Raphson procedure; otherwise, coordinate descent method is adopted. 

54. Check the two stopping rules, if at least one is satisfied, stop the algorithm; otherwise, 
set k — k + 1 and repeat S2-S4 until k = K. 

2.6 Selection of Tuning Parameter 

The performance of penalized likelihood estimators depends heavily on the choice of tuning 
parameters, that is A in LASSO and (A, 7) in MCP. This is usually accomplished through 
cross-validation or by using some information criterion such as AIC, BIC. 

Information criteria derived using asymptotic arguments for the classical regression models 
are usually problemat ic when applied to pen alized regression problems where p^S> n. For high- 
dimensional GLM, in I Chen and Chenl (|2008l ). Extended BIC (EBIC) was proposed by adding 
an extra penalty term on top of BIC. It is defined as, for < 7 < 1, 

EBIC 7 (s) = -2 log L n {9(s)} + ^(s)logn + 27 log 

where s is a subset of {1, • • • ,p}, 9(s) is the parameter 9 with those components outside s 
being set to or some pre-specified values, 8(s) is the maximum likelihood estimator of 9(s), 
and v(s) is the number of components in s. In this paper, we employ both cross-validation 
and EBIC. 
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3 APPLE with MCP Penalty 



Different from LASSO, MCP is a nonconvex penalty which was proposed by Zhang (2010). 
The penalty is a quadratic spline defined on [0, oo) by 

Px,-,(t) = X [\l-x/(<y\))+ dx, (6) 

where the parameter 7 > measures the concavity of the penalty, and A is the regularization 
parameter. The APPLE algorithm for the MCP penalized GLM is slightly different from what 
we proposed in Section 2 for LASSO. Due to the non-convexity of MCP penalty, in this section, 
we will only focus on the main differences from the LASSO case. 

3.1 Problem Setup 

For MCP penalized GLM, the corresponding target function is 



i=l .7 = 1 J ° 7 



)+ dx. 



(7) 



As introduced in IZhand (|201Ch and IZhang and Huand | 2008), the sparse Riesz condition 
(SRC)(c, c* , q) holds under some mild regular conditions. As a result, in the low-dimensional 
manifolds with dimension smaller than q, the convexity of —£(y; (3) can dominate the concavity 
of the penalty, which will lead to the convexity of the target function Q even with the choice 
of nonconvex penalty. Therefore, under the SRC, for estimator with sparsity smaller than q, 
the KKT conditions are still valid to obtain a global minimizer. The KKT conditions are given 
as follows, 







ap l0 o =A) 
at I 

ai I 

ap, \p j =$ j 



A(l - ^)sgn(A0 




for0<|ft|<A 7 , 
for 1/3,1 > A 7 , 
for 4 =0. 



(8) 



3.2 Grid of Penalty Parameter 

The grid {Ai, • • • , Ax} of penalty parameters is identical to that in the LASSO case. 



3.3 Update 

In the LASSO case, the predefined threshold A is constant across all variables. Therefore, 
from the KKT condition (|3]), as long as a variable is activated, it stays in the active set as A 
decreases. But in the MCP case, for the same A, the effective penalty level on each variable 
is different depending on the magnitude of the estimate, as shown in (|8]). Therefore, detecting 
the active set in MCP is different from LASSO. Here we introduce a new active set detection 
method using the KKT conditions, that, to our best knowledge, has not appeared before for 
nonconvex penalties in the literature. For a given Afc, we define the active set At as 

A k = {A fc _ x U jV fc } \ D k , 
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where 



N h = {j €{!,-■■ |0*/dft|>A fc }, 



and 



D k = {j € A k -i n A t 



sgn(/3 (fc ~ 



<0}. 



This means, with decreasing threshold Xk , a particular variable becomes active when it satisfies 
the KKT condition (|SJ). Then the variable will stay activated until it crosses (i.e., the index 
lies in Dk ) , which means the covariates of the estimators have different signs in two consequent 
steps. From our experience, variables which cross at some point in the path are usually 
noise variables. If this deleted variable satisfies the KKT condition (|HJ) along the path later, 
we re-activate it. With decreasing A, the optimization problem ((7} will no longer be convex 
at some point. Therefore, the proposed treatment of deleting variables which cross at some 
point will make the path more stable. 

In accordance with Section 2.3, the step size is defined as Afe = Afe+i — A^. For a given Afe and 
active set Ak, we update the active covariates altogether by using the quadratic approximation, 



S (fcH 

Pa l . 



1,0) _ 3(fe) 



where and are the first and second derivatives of (3 ' with respect to A, respectively. 
Since the intercept is not penalized, the first coordinates of and d^ are both 0. 



Now, we have s 



(k) 



(0 



s^o )'i where s?q 



is defined as 



,(fc) 



- x A k \{o} v<k)x A k \{o} - r 



( 



(9) 



where is given in (|10|) in the Appendix, and V is the population version. T = diag{l/7, ■ • • , I/7}, 
and sgn(-) is the sign func tion of a vector. 

In MCP (Zhang, 2010), the tuning parameter 7 is free to vary from 1+ to 00. For the 
derivative s' fc ' defined in ([9]), particularly in logistic regression, 7 has to be large enough in 
order to make the matrix 



V {k) X 



A k \{0} 



r 



invertible. However, if 7 is too large, the MCP penalty ([6]) is approximately equal to A|t|, which 
is the same as the LAS SO penalty. In that c ase, i t is hard to find the advantages which MCP 
enjoys over LASSO. In lBrehenv and Huand | |201lf ). adaptive rescaling was proposed to solve a 
similar issue. They replaced P\,-y(\Pj\) with Px,i{\vj/3j\), where Vj is the j-th diagonal element 
of the Hessian matrix. Since they used a coordinate descent algorithm, updating coordinates 
one-at-a-time, the rescaled updates are straightforward after this change. But in our algorithm, 
all the active variables are updated together. Therefore, a new adaptation method is needed. 



The adaptation we use is to replace Px,~/(\f3j\) by p ( k ) (|w^ n /3j|), where u^ n is the 



Ik) 



smallest eigenvalue of the matrix — X' 



A fc \{0> 



V 



A, 711 

X A k \{o}- Then, 



- ^ X 'A k \(o } V (k) X AkX{0) -nitr) 1 (-sgn(3S\ {0} )). 



Now, for any 7 > 1, 



±x' AkX{0} v^x 



A k \{0] - w m 



- M ilr>o. 
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Therefore, the singularity problem in ([9]) is avoided for all 7 > 1. 

The correction method we introduced in the LASSO case is based on the fact that the prob- 
lem ([2]) is convex. Here in MCP, although the original problem is not convex, in each step after 
the adaptation, our problem is still convex in a low-dimensional manifold as long as 7 > 1. One 
important issue is that when calculating the first and second order derivatives in the Newton- 

Raphson correction, u^ n is also a function of p (see Appendix). To a void computing implicit 
deriva tives, we use the popular quadratic approximation method (e.g.. iMcCullagh and Neldej 
(|l989l) ) to the negative log-likelihood, which turns out to be very effective. Our new target 
function is 

L(X) = ±-{y-XP)'V{f-Xp) 

In 




)+ dx, 



where y = X/3 + V^ 1 (y — n). The detailed formulations are presented in the Appendix, 
(|ll|l - (|13| l for logistic case and (|14p ~ (|16p for Poisson case. 

The same sparsity criterion ((5]l is used, and the corresponding Newton-Raphson or coordi- 
nate descent method is applied. 

3.4 Stopping Rules 

Stopping rules are the same as the ones discussed in Section 2.4. But different from the LASSO 
case, if a variable is activated in a certain step of the MCP procedure JS), it may turn inactive 
later, and even be activated again later on. So for the same data, we can consider making the 
upper limit (Section 2.4, (b)) in MCP larger than that would be for the LASSO case. 

3.5 Summary of the Algorithm 

The algorithm is the same as in LASSO, except S2 is replaced with S2' described by the 
following. 

S2' Calculate the active set by Ak = {Ak-i U Nk} \ Dk, where 

Nk ={j'6{l,-.P}\ A k -i : \dl/dPi\ > X k } , 

and 

D k = {j G A k -i n A fe - 2 : Bgn($ fc ~ 1) )agn($ fc ~ a) ) < 0}. 

3.6 Selection of Tuning Parameter 

The selection methods are mainly the same as the ones discussed in Section 2.6. But as an 
advantage MCP is shown to possess in our numerical results, the estimators will stay in their 
optimal value for a certain interval of regularization parameter A. This bears some advantages 
in selecting the tuning parameters. 

4 Simulation Results 

Logistic and Poisson regression models are two popular generalized linear models. For each 
model, we present results of LASSO/MCP penalized methods. For the LASSO penalty, we 
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compare APPLE LASSO with the GLMNET package jFriedman et all I2010T ). For MCP, 



we compare APPLE MCP with the NCVREG package (|Brehenv and HuaneL I20T1I ) . Since 



NCVREG only applies to Gaussian and logistic models, no comparable results are presented 
for MCP penalized Poisson model. For each setting, we report results for different tuning pa- 
rameter selection methods, including EBIC and A"-fold CV. The critera include false positives 
(FP), true positives (TP), h loss = ||/3 - /3°||i, £ 2 loss = - f3°\\ 2 . We also compare the 
computational cost with NCVREG for the MCP penalty case. 

4.1 Logistic Regression 

Example 1 We consider a logistic regression model with dimension p = 1000, sample size 
n = 500 and d — 3, where d is the number of nonzero elements in (3° . The first 5 dimensions of 
(3° are (3, 1.5, 0, 0, 2), while the rest are all zeros, /3o = 0. The vector x follows a multivariate 
normal distribution with zero mean and covariance between the i-th and j-th elements being 
p\*-3\ m th p = and p = 0.5 in two different settings. In the simulation, 100 repetitions are 
performed. Except for the dimension and samp le size whic h are increased to 1000 and 500, 
respectively, the settings here are the same as in \Fan and Li 1(200 1\) . 



In Figure 1(a) we compare the solution paths of the APPLE algorithm for MCP and 
LASSO. We can see that the MCP path is less smooth than the LASSO path, but has intervals 
at which the estimators stay constant, and yields a sparser model. With the convergence 
stopping rules adopted in all simulations here, the corresponding solutions of the MCP path are 
sparse even near the end of the path. Actually, the size of active set does not exceed the square 
root of the sample size, which means that the Newton- Raphson correction is used throughout 
the whole path. Notice tha t there is a j u mp o n the LASSO path, which is caused by the change 
of correction method. See iFeng et al.l <|2012h for the stability comparison of various penalty 



functions. In Figure |l(b)| the APPLE and GLMNET LASSO paths are illustrated. Before 
the change point, using the Newton-Raphson correction method, the APPLE path exhibits 
better estimation with a sparser model. After the change point, coordinate descent correction 



is employed, which makes the two paths identical. In Figure 1(c) the APPLE and NCVREG 
MCP paths are compared given the same concavity tuning parameter 7 = 1.3. APPLE paths 
are significantly smoother than NCVREG paths. Although both of these paths stay at the 
"optimal" level, APPLE paths have a longer period, which makes the model selection task 



easier and leads to more stable estimation. In Figure 1(d) APPLE MCP paths with different 
concavity parameters (7 = 1.3,3,100) are presented. This shows that as 7 gets larger, the 
"flat" period of constant optimal magnitude gets shorter, and the APPLE MCP path eventually 



approaches the LASSO path when 7 becomes sufficiently large. In Figure 1(e) we show APPLE 
LASSO paths with different correction methods throughout the whole path. We can see that in 
the beginning of the path when A is large, the differences between these two correction methods 
are negligible. However, as A decreases, more variables are recruited and the Newton-Raphson 
method becomes unstable, making the coefficient estimates "take-off" more quickly compared 
with the coordinate descent method. Therefore, we recommend the hybrid approach of using 
the Newton-Raphson in the first part of the path, and later switch to coordinate descent when 
the number of active variables becomes large enough. 

The FP, TP, £1 loss and £2 loss results for Example 1 are summarized in Table [1] When 
p = 0, comparing the APPLE LASSO and GLMNET, we see the results from EBIC are similar 
for these two methods. However, for the CV, APPLE tends to provide a model with smaller FP 
values while keeping TP the same. When MCP is applied, similar observations can be found. 
Overall, comparing with the existing methods, APPLE does a better job than GLMNET and 
NCVREG in the LASSO and MCP cases, respectively. In addition, for the MCP penalty, 
APPLE provides a smoother path than NCVREG. 
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(a) (b) (c) 




(d) (e) 

Figure 1: Solution paths for logistic regression in Example [TJ where p = 1000, n = 500, d = 3 
and p = 0. In (a), the solid lines and dotted lines are the solution paths for APPLE MCP and 
APPLE LASSO, respectively. In (b), the solid lines and dotted lines are the solution paths for 
APPLE LASSO and GLMNET LASSO, respectively. In (c), the solid lines and dotted lines arc 
for APPLE and NCVREG MCP, respectively. In (d), the solid lines, dashed lines and dotted lines 
are solution paths of APPLE MCP with different 7 values. In (e), the solid lines and dotted lines 
are the solution paths for APPLE LASSO with different correction methods. For each panel and 
each type of lines, the important variables are selected in the same order. As A becomes smaller, 
variables with index 1, 5 and 2 are selected one by one. When A gets smaller than 0.05, noise 
variables are selected. 
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Table 1: Comparisons for APPLE with GLMNET and NCVREG for LASSO and MCP penalties, 
respectively, in Example [1] where p = 1000, n = 500 and d = 3. Design matrices with different 
correlation, and different selection criteria arc presented. The medians of false positive (FP), true 
positive (TP), l\ loss, and £2 loss are reported over 100 repetitions, enclosed in parentheses are the 
corresponding standard errors. 



Model 


Package Method 


FP TP ix loss l 2 loss 


LASSO 
P = {) 


APPLE EBIC 
CV 

GLMNET EBIC 
CV 


0.30(0.54) 3.00(0.00) 3.64(0.28) 4.44(0.70) 
23.51(7.68) 3.00(0.00) 4.93(1.27) 2.52(0.74) 

0.25(0.48) 3.00(0.00) 3.83(0.23) 4.94(0.58) 
48.44(26.02) 3.00(0.00) 4.93(1.27) 2.52(0.74) 


MCP 

p = 0, 7 = 1.3 


APPLE EBIC 
CV 

NCVREG EBIC 
CV 


0.03(0.17) 3.00(0.00) 0.68(0.38) 0.21(0.25) 
0.26(0.03) 3.00(0.00) 0.86(0.96) 0.34(0.67) 
0.65(0.45) 3.00(0.00) 2.50(0.31) 1.51(0.42) 
1.86(3.44) 3.00(0.00) 2.05(0.48) 1.56(0.45) 


MCP 

p = 0,7 = 3 


APPLE EBIC 
CV 

NCVREG EBIC 
CV 


0.03(0.17) 3.00(0.00) 0.68(0.38) 0.21(0.25) 
0.26(1.03) 3.00(0.00) 0.87(0.96) 0.34(0.67) 
0.03(0.17) 3.00(0.00) 0.85(0.47) 0.32(0.36) 
2.65(5.37) 3.00(0.00) 1.04(0.78) 0.30(0.32) 


LASSO 
p = 0.5 


APPLE EBIC 
CV 

GLMNET EBIC 
CV 


0.10(0.30) 3.00(0.00) 3.63(0.27) 4.44(0.70) 
19.20(0.45) 3.00(0.00) 3.72(0.26) 2.24(0.64) 
0.10(0.30) 3.00(0.00) 3.78(0.22) 4.84(0.58) 
43.88(26.02) 3.00(0.00) 4.77(1.27) 2.48(0.74) 


MCP 

p = 0.5, 7 = 1.3 


APPLE EBIC 
CV 

NCVREG EBIC 
CV 


0.02(0.15) 3.00(0.00) 0.66(0.32) 0.18(0.20) 
0.12(0.25) 3.00(0.00) 0.79(0.49) 0.30(0.26) 
0.07(0.26) 2.98(0.15) 0.73(0.44) 0.25(0.44) 
3.33(2.67) 3.00(0.00) 2.84(0.42) 2.77(0.85) 


MCP 

p = 0.5, 7 = 3 


APPLE EBIC 
CV 

NCVREG EBIC 
CV 


0.01(0.10) 3.00(0.00) 0.72(0.34) 0.22(0.22) 
0.14(0.41) 3.00(0.10) 0.87(0.60) 0.34(0.47) 
0.01(0.11) 3.00(0.00) 1.62(0.54) 0.96(0.55) 
4.93(5.60) 3.00(0.00) 1.55(0.97) 0.58(0.52) 
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Tabic 2: Comparison of the computational cost for the APPLE and NCVREG packages in different 
simulation settings. The medians of the computation time (in seconds) are reported, enclosed in 
parentheses are the corresponding standard errors. CPU: Intcl(R) Xcon(R) L5420 @ 2.50GHz. 



Dimension 7 


NCVREG APPLE 
P = Q 


NCVREG APPLE 
p = 0.5 


p = 2 7 1.3 

n = 50 3 


1.33(0.41) 0.13(0.07) 
0.22(0.28) 0.06(0.04) 


1.42(0.52) 0.14(0.06) 
0.15(0.32) 0.07(0.02) 


p = 2 8 1.3 
n = 100 3 


4.09(1.17) 0.37(0.21) 
0.35(0.53) 0.19(0.06) 


5.58(1.29) 0.42(0.11) 
0.32(0.40) 0.27(0.10) 


p = 2 9 1.3 
n = 200 3 


18.11(4.47) 1.15(0.18) 
1.58(0.77) 0.86(0.13) 


27.47(6.25) 1.21(0.20) 
1.32(0.55) 1.08(0.14) 


p = 2 w 1.3 
n = 500 3 


123.53(22.87) 6.29(0.67) 
6.55(1.49) 5.55(0.66) 


186.95(33.00) 6.55(0.80) 
9.01(4.42) 6.03(0.65) 



APPLE is an efficient algorithm for computing the solution path for penalized likelihood 
estimators, particularly for nonconvex penalties. Table [2] illustrates the median time required 
to fit the entire path and the corresponding standard errors of the NCVREG and APPLE 
algorithms. Here, we use the same setting as Example 1 except for the different p and n. In 
all these scenarios, APPLE is faster than NCVREG. The main reason is that for NCVREG, 
all covariates need to be computed in every step and coordinate descent algorithm needs to 
loop over all covariates as well. However, for the APPLE algorithm, in every step, only the 
active set detection needs the loop over all dimensions. Once the active set is calculated, the 
subsequent estimate updates focus only on the variables inside the active set, which has a much 
smaller dimension. Notice the difference is more substantial in the 7 = 1.3 case. 



4.2 Poisson Regression 

Example 2 We consider a Poisson regression model with p = 1000, n = 500 and d — 3, where 
d is the number of nonzero elements of f3° . The first 5 dimensions of (3° are (1.2,0.6,0,0,0.8), 
and the rest are all zeros, Pq = 0. The vector x follows a multivariate normal distribution with 
zero mean and covariance between the i-th and j-th elements being p' 1- -" with p = and p = 0.5 
in two different settings. We repeated each simulation 100 times. Except for the dimension and 
sa mple size wh i ch are increased to 1000 and 500, respectively, the settings here are the same as 
in \Zou and hi \%00&) . 



In Figure 2(a) solution paths for APPLE MCP and LASSO are presented. Similar to the 
logistic regression case, LASSO yields a smoother path, while MCP results in better estimation 
with a nearly "flat" region of optimal level. In addition, at the end of the MCP path, the 
solution is still sparse in terms of our "square root of sample size" criterion. In Figure 2(b) we 
compare APPLE and GLMNET LASSO paths. As in the logistic regression model, there is a 
small jump in the APPLE LASSO path, which is caused by the change of correction method. 
After the correction method switches to the coordinate descent, the APPLE path coincides 



with the GLMNET LASSO path. In Figure 2(c) APPLE MCP paths with different concavity 



parameters are presented. The continuous gradual change with respect to 7 is clear, with paths 
getting smoother and tending to the LASSO path as 7 becomes sufficiently large. In Figure 
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APPLE MCP and LASSO APPLE and GLMNET LASSO 




(a) 



(b) 




(c) (d) 

Figure 2: Solution paths for poisson regression in Example [2] where p — 1000, n = 500, d = 3 and 
(O = 0. In (a), the solid lines and dotted lines arc the solution paths for APPLE MCP and APPLE 
LASSO, respectively. In (b), the solid lines and dotted lines are the solution paths for APPLE 
LASSO and GLMNET LASSO, respectively. In (c), the solid lines, dashed lines and dotted lines 
are solution paths of APPLE MCP with different^ values. In (d), the solid lines and dotted lines 
are the solution paths for APPLE LASSO with different correction methods. For each panel and 
each type of lines, the important variables are selected in the same order. As A becomes smaller, 
variables with index 1, 5 and 2 are selected one by one. When A gets near to zero, noise variables 
are selected. 



Tabic 3: Comparison for APPLE with GLMNET and NCVREG for LASSO and MCP penalties, 
respectively, in Example [5J where p — 1000, n — 500 and d = 3. Design matrices with different 
correlation, and different selection criteria are presented. The medians of false positive (FP), true 
positive (TP), l\ loss, and £2 loss are reported over 100 repetitions, enclosed in parentheses are the 
corresponding standard errors. 



Model 


Package Method 


FP TP li loss £ 2 loss 


LASSO 

P = 


APPLE EBIC 
CV 

GLMNET EBIC 
CV 


0.70(0.93) 3.00(0.00) 0.42(0.16) 0.06(0.04) 
9.41(3.75) 3.00(0.00) 0.33(0.11) 0.02(0.01) 
0.66(1.20) 3.00(0.00) 0.89(0.15) 0.24(0.08) 
28.33(22.61) 3.00(0.00) 0.74(0.33) 0.06(0.03) 


MCP 

p = 0, 7 = 1.3 


APPLE EBIC 
CV 


0.00(0.07) 3.00(0.08) 0.42(0.10) 0.06(0.03) 
0.00(2.76) 3.00(0.26) 0.49(0.27) 0.07(0.05) 


MCP 

p = 0, 7 = 3 


APPLE EBIC 
CV 


0.00(0.18) 3.00(0.62) 0.51(0.17) 0.08(0.05) 
0.00(5.82) 3.00(0.43) 0.49(0.17) 0.05(0.03) 


LASSO 
p = 0.5 


APPLE EBIC 
CV 

GLMNET EBIC 
CV 


0.86(0.95) 3.00(0.00) 0.29(0.11) 0.03(0.03) 
9.57(5.27) 3.00(0.00) 0.26(0.09) 0.01(0.01) 
0.79(1.37) 3.00(0.00) 1.19(0.26) 0.51(0.23) 
25.64(13.92) 3.00(0.00) 0.54(0.17) 0.04(0.02) 


MCP 

p = 0.5, 7 = 1.3 


APPLE EBIC 
CV 


0.00(0.53) 3.00(0.22) 0.16(0.54) 0.00(0.45) 
0.00(2.07) 3.00(0.28) 0.28(0.78) 0.03(0.75) 


MCP 

p= 0.5, 7 = 3 


APPLE EBIC 
CV 


0.00(0.16) 3.00(0.12) 0.16(0.34) 0.00(0.42) 
0.00(3.31) 3.00(0.19) 0.25(0.56) 0.02(0.64) 
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2(d) just as the logistic regression case, Newton-Raphson correction yields more aggressive 
solution. From the simulation results presented in Table [3] APPLE LASSO performs much 
better than GLMNET when CV is applied in both p = and p — 0.5 cases. Also, it is obvious 
that MCP does a better job than LASSO in terms of FP and TP. 

Another interesting observation is the behavior when using different values of 7 for MCP 
in Figure [2(c)| from the simulation results presented in Table [3] neither the selection nor the 
estimation seems to be sensitive to the choice of 7. This shows the stability of MCP in terms 
of the concavity parameter 7. 

5 Applications 

In this section, we present the analysis for two gene expression datasets with large dimension 
p and small sample size n. 

Example 3 (i) We consider the leukemia dataset previously analyzed in \Golub et all \l99§(l . 
There are p — 7, 129 genes and n = 72 samples coming from two classes: ^7 in class ALL 
(acute lymphocytic leukemia) and 25 in class AML (acute myelogenous leukemia), (ii) The 
Neurob lastoma data set, obtained via the MicroArray Quality Control phase-II (MAQC-II) 
vroiect \Consortiun\ \20lA) . consists of gene expression profiles for p = 10, 707 genes from 251 
patients of the German Neuroblastoma Trials NB90-NB2004, diagnosed between 1989 and 2004- 
We analyzed the gene expression data with the 3-year event-free survival (3-year EFS), which 
indicates whether a patient survived 3 years after the diagnosis of neuroblastoma. There are 
n = 239 subjects with the 3-year EFS information available (49 positives and 190 negatives) . 

Potentially, a large number of genes are affected by the two types of leukemia in (i) or 
negative/positive information about 3-year EFS in (ii). In addition, the sample size n is much 
smaller than the dimension p for both problems. Therefore, a regularized logistic regression 
model is suitable. We impose LASSO and MCP penalties to these data sets, and compare the 
prediction accuracy yielded by the APPLE, GLMNET and NCVREG packages, respectively. 

To check the stability of the results, we randomly split the data into training and testing 
sets 5 times for each example, and report the median prediction accuracy on the testing data 
and the median model size. For simplicity, EBIC was used to select the tuning parameter. For 
the MCP case, we fix 7 = 1.3 in both examples, which turned out to have better performance 
than larger 7 values in our simulation results. Notice that in some other high-dimensional 
variable selection literature, a larger 7 was chosen to present the results. But when 7 is too 
large, the MCP solution path has little difference from the LASSO path, as shown in the figures 
of our simulation examples. 

From the results in Table [4j where test error is the number of misclassified subjects out 
of the size of the test dataset, we notice that for LASSO, APPLE leads to a smaller model 
size while having the same test error in both examples when compared with GLMNET. For 
MCP, in the leukemia example, APPLE only needs 1 variable to achieve the same test error 
as LASSO, and a better test error than the NCVREG. For the neuroblastoma example, MCP 
performs very well for both APPLE and NCVREG as compared with the LASSO. 



6 Summary 

In this paper, we propose a new algorithm, APPLE, for calculating the Approximate Path 
for Penalized Likelihood Estimators. The results from the simulation studies and real data 
examples provide compelling evidence that the APPLE algorithm is a worthwhile alternative 
to the existing methods. 
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Table 4: Comparison for APPLE with GLMNET and NCVREG in LASSO and MCP (7 = 1.3), 
respectively. The medians of model size and test error (the ratio of number of wrongly classified 
subjects to size of test dataset) for two real data sets are reported, enclosed in the enclosed in 
parentheses arc the corresponding standard errors. 



Data 


Criteria 


LASSO 
APPLE GLMNET 


MCP (7 = 1.3) 
APPLE NCVREG 


leukemia 


model size 
test error 


11 

4/36 


13 

4/36 


1 

4/36 


3 

5/36 


neuroblastoma 


model size 
test error 


37 

22/123 


44 

22/123 


5 

22/123 


4 

23/123 



APPLE takes significantly less time than NCVREG, and the same order of time as GLM- 
NET. In each step, APPLE only updates the variables in the active set when the current model 
is sparse enough. When the model involves too many noise variables, APPLE switches to a 
coordinate descent correction. 

The 7 adaptation meth od we adopt here is different from the one originally introduced by 
IBrehenv and Huand fcOllf ). It is due to the vector update performed in APPLE. Here, the 
minimum eigenvalue adaptation preserves the minimization of the maximum concavity of the 
MCP penalty while maintaining the stability in the Newton-Raphson update. 

A public domain R language package apple is available from the CRAN website. 



A Logistic Regression 
A.l LASSO 

In logistic regression, we assume (pa, yt), i = 1, • • • ,n are i.i.d. with P(yj = l\xi) = Pi = 
cxp((3'xi)/ (l+exp(/3'x{)). Then the target function for the LASSO penalized logistic regression 
is defined as 



.1* f 
L&) = Y,{Vi(P'*i) - log(l + expOa'asO)} + A V \f3 

The KKT conditions are given as follows. 



j 1 



cxp(/3 Xi ) 
-fcxp(/3 x t ) 



Vi}xij = Asgn(/3j), &^0; 



Er-i{ 1 ; xp( ^'\ -y*Ki<A, 



En 
i=l 1 



cxp(/3 Xi) 
+cxp(,3 Xi) 



We define active set Ak as 
Ak = {j 



1 - 

in *r^i 



En 
i=lVi- 



exp(/3 ( Xi) 
l + exp(/3 x^ 



Pi = 0; 



} Xij I > A fc \ U {0} 
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To update, we define 



( fe ) _ exp (k) Xi) 



l + cxp(/3 Xi) 



r(k) 



diag{ir{ k >(l--jri*0,'" ,^(1-^)} 



(*)/ 



diag{^ fc) (l-^) 



1 + oxp(/3 cci 



(fch 1 - exp(3 (M a:„) 
1 + exp(/3 as„) 



then s w = (0, dW = (M-o )'» where 



(fc)'y 



C (fe) = diag(T w X AfcU0}S ^), 



X A fc \{0}^ (fc) ^A fc \{0} 



Bgn(3i\{0})» 



(fc'K 



x X 



A fc \{0} 
1 



(fc) 



(fc) 



To correct, 



9/3 A , n 



X A k \{0}£ ^A fc \{0} s -0 



8L^ k) 1 / / exp(3 (fc) 'x 



-y) +A fc Bgn(0,3jJ U0} )' ) 

1 + cxp(/3 A ) 



d 2 L^ 
80 



= -A Al .y (fe) A, 



A.2 MCP 

For MCP penalized logistic regression, we define the target function as 

i J\ rW t 

L{(3) = --Y J {yi{0Xi)-\og{l + C x V {{3 l x i ))}+\Y J I ( 1 ~l^)+ dt - 



The KKT conditions are given as follows. 



l+exp(/3 a;;) 
n cxp{/3 ajj) 



Iftl > A 7 . 
ft = o. 



En 
i=l ~ 



+ cxp(/3 



En 
i=l 



For a given A^, define the active set Aj, as 

Afc = {A fc _i U iV fc } \ D h , 
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where 



N k = {j G {!,-•■ ,p}\A k -i 



1 n 



cxp(/3 Xi 



!=1 l + exp(/3a5i) 



2/i}a: i:( | > A fc }, 



and 



D fc = {j e A fe _r n A fc _ 2 : sgn(/3f - 1) )sgn(/3f- 2) ) < o} . 
To perform adaptive rescaling on 7, define 

r = diag{l/ 7 ,-- - ,1/7}. 
To update, the derivatives are defined as follows, 

s<fc o = -(^ X A k \{o}V (k)x A k \{ } - U min rj sgn(3^ uo} ), 



x X 



A fc \{0} S -0> 



,<*0 



X A k \{0}(, 



(k) 



and 



sgn(/3_ ) 



sgn(/3 



To correct we use 



where 



Vsgn(^ )J{|^ |<AW 7 }7 



n i= i 1 + exp(/3 v a5i 



+ A fe sgn(0,/3 



(fe)' 

A k \{0} 



I'fl- 



l3A fc \{0}k 

A fc7 j + ' 



= — X AuV X A k — Wmi n r, 



and 



v=[v 



exp \fi (k) X 
1 + exp (p (kV X 
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B Poisson Regression 
B.l LASSO 

In Poisson regression, we assume (xi, ?/;), i = 1, • • • , n are iid with P(Y = yt) — e~ Xi /(yi)\, 
where log A; = f3'xi. Then criterion for the LASSO penalized Poisson regression is defined as 



n p 

L((3) = - { expGa'xi) - ViiP'xi)} + |ft 



3=1 



The KKT conditions are given as follows. 



k Efai { exp(/3 jeO - = Asgn(ft), / 0. 

ljT £"=i { ex P(3 asi) - I/i}a5ij| < A, ft- = 0. 

A) = log — 



S"=l 



For a given Xk, we define the active set as follows. 

1 n k ' 

A k = {i : I - - exp(3 (fe) «Bi)ar«}| > A fe }u{0}. 



To update, we define 



then 



and 



= diag{exp(/3' ' xi),--- , exp(/3' ' £c„)}, 



;(*)' 



(/=)' 



o (*0 



X A k \{0}V (k) X Ak \ {0} 



To correct, 



d (k) - 



X -4 fc \{0}^ (fe)X A fc \{0} 



A A fc \{0}€ ^A fc \{0) s -0- 



±x' A „ fe X p0 kY X) -y)+ A fc sgn(0,3^' uo} )', 



9/3 2 , n 



x' A v^x Ak . 



B.2 MCP 

For MCP penalized Poisson regression, we define the target function as 

= -J^i^f 3 '^ - y^' x ^} + X J2 / (i-t-)+*- 

n >=i j=i J ° " f 
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The KKT conditions are, 



k EIU { eMP'xi) ' lft}asii = A(l - ^)sgn(ft), < < A 7 . 



£ E7=i { exp(/3 jEi) - yi}x<j = 0, 
l£ ELi { exp(3'a3i) - j/<}««| < A, 
/3o=log— Ski^-. 

For a given Xk, the active set is defined as 

A k = {Ak-iUN k }\D h , 

where 



|ft-|>A 7 - 

h = o. 



and 



1 n 

iV fc = {j €{!,-•• ,p}\ A fe _i : |-53{exp(3'asi)-J/i}iB«| > A fc }, 



C fc = {j G A fe _! n A fe _ 2 : sgn(/3f ~ 1) )sgn(/3f - 2) ) < 0}. 



To update, the derivatives are defined as follows, 



s ( _ fc (] = ( ^X AkX{0} V (K 'X Ak \ {0) - it min r ] sgn(/3X;\ {0} ), 



s(fc) 



j(fe) 



and 



s g n (/3-o) = 



X A fc \{o}5 (fe)x -4 fc \{o}S v _"o 



Wn(3^„ fc )^l3ilj<A (fc) 7}y 



1 71 fe ' 

sgn(4 J - k) ) = { exp (^ a< ) _ yi l a 



To correct we use 



where 



and 



dl,™ _ _1 



+w(o ) 3s; {0}) '(i-^H) +) 

d 2 £ (fc) i v ' vWv r 
-e0r- = n XA > v x *» 



v = 



y - exp /3 A 
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